Object clustering using inter-layer links

ABSTRACT

One aspect relates to clustering a group of objects of a first object type based on a relative importance using links extending between objects of the first object type and certain objects of the second object type. In one embodiment, the first object type is a Web page object type and the second type is a user object type.

TECHNICAL FIELD SECTION

[0001] This disclosure relates to clustering, and more particularly toclustering heterogeneous objects (certain embodiments are also referredto herein as clustering of multiple types of inter-related objects).

BACKGROUND

[0002] Clustering involves grouping of multiple objects, and is used insuch applications as search engines and information mining. Clusteringalgorithms group objects based on the similarities of the objects. Forinstance, Web page objects are clustered based on their content, linkstructure, or their user access logs. The clustering of users is basedon the items they have selected. User objects are clustered based ontheir access history. Clustering of items associated with the users istraditionally based on the users who selected those items. A variety ofclustering algorithms are known. Prior-art clustering algorithms includepartitioning-based clustering, hierarchical clustering, anddensity-based clustering.

[0003] The content of users' accessed Web pages or access patterns areoften used to build user profiles to cluster Web users. Traditionalclustering techniques are then employed. In collaborative filtering,clustering is also used to group users or items for betterrecommendation/prediction.

[0004] Use of these prior clustering algorithms, in general, has certainlimitations. Traditional clustering techniques can face the problem ofdata sparseness in which the number of objects, or the number of linksbetween heterogeneous objects, are too sparse to achieve effectiveclustering of objects. With homogenous clustering, the data set beinganalyzed contains the same type of objects. For example, if thehomogenous clustering is based on a Web page and a user, then the Webpage objects and the user objects will each be clustered separately. Ifthe homogenous clustering is based on an item and a user, then the itemobjects and the user objects will each be clustered separately. In suchhomogenous clustering embodiments, those objects of the same type areclustered together without consideration of other types of objects.

[0005] Prior-art heterogeneous object clustering cluster the object setsseparately. The heterogeneous object clustering uses the links only asflat features representing each object node. In prior art heterogeneousclustering, the overall link structure inside and between the layers isnot considered, or alternatively simply treated as separated features.

SUMMARY

[0006] This disclosure relates generally to clustering. One aspect ofthe disclosure relates to clustering a group of objects using linksextending between certain objects of the first object type and certainobjects of the second object type. In one embodiment, the first objecttype is a Web page object type and the second object type is a userobject type.

BRIEF DESCRIPTION OF THE DRAWINGS

[0007] Throughout the drawings, the same numbers reference like featuresand components.

[0008]FIG. 1 is a block diagram of one embodiment of computerenvironment that can be used for clustering;

[0009]FIG. 2 is a block diagram of one embodiment of a framework forclustering heterogeneous objects;

[0010]FIG. 3 is a block diagram of one embodiment of hybrid net model;

[0011]FIG. 4 is a block diagram of another embodiment of computerenvironment that is directed to the Internet;

[0012]FIGS. 5a and 5 b is a flow chart of one embodiment of clusteringalgorithm;

[0013]FIG. 6 is a block diagram of another embodiment of a framework forclustering heterogeneous objects that includes a hidden layer;

[0014]FIG. 7 is a flow chart of another embodiment of clusteringalgorithm; and

[0015]FIG. 8 illustrates a block diagram of one embodiment of a computerenvironment that can perform the clustering techniques as described inthis disclosure.

DETAILED DESCRIPTION

[0016] The use of computers and networked environments, such as theInternet, continues to expand rapidly. One important tool that allowslarger networks, such as the Internet, to be effectively used, is searchtools. Clustering of objects stores and retrieves those objects that areof similar types. Clustering is used in many aspects of search tools topermit a user to retrieve applicable data from such memory locations asa database without also retrieving a lot of marginally significant data.Certain aspects of this disclosure make clustering more applicable toretrieving and storing large sets of data objects. One embodiment of thepresent disclosure provides for clustering of heterogeneous objects.Clustering of heterogeneous objects is also considered clustering ofmultiple types of inter-related objects.

[0017] One embodiment of computer environment 20 (that is a generalpurpose computer) that can benefit by the use of clustering is shown inFIG. 1. The computer environment 20 includes a memory 22, a processor24, a clustering portion 28, and support circuits 26. The supportcircuits include such devices as a display and an input/output circuitportion that allow the distinct components of the computer environment20 to transfer information (i.e., data objects).

[0018] Clustering is performed within the clustering portion 28. Theclustering portion 28 can be integrated within the memory 22 and theprocessor 24 portions of the computer environment. For example, theprocessor 24 processes the clustering algorithm (which is retrieved frommemory) that clusters the different objects. The memory 22 (such asdatabases) is responsible for storing the clustered objects and theassociated programs and clustering algorithms so that the clusteredobjects can be retrieved (and stored) as necessary. The computerenvironment 20 may be configured as a stand-alone computer, a networkedcomputer system, a mainframe, or any of the variety of computer systemsthat are known. Certain embodiments disclosed herein describe a computerenvironment application (a computer downloading Web pages from theInternet). It is envisioned that the concepts described herein areapplicable to any known type of computer environment 20.

[0019] This disclosure provides a clustering mechanism by which thepercentage of the returned results that are considered reliable (i.e.,are applicable to the user's query) is increased. Clustering can beapplied to such technical areas as search tools, information mining,data mining, collaborative filtering, etc. Search tools have receivedattention because of their capabilities to serve different informationneeds and achieve improved retrieval performance. Search tools areassociated with such computer aspects as Web pages, users, queries, etc.

[0020] The present disclosure describes a variety of clusteringalgorithm embodiments for clustering data objects. Clustering of dataobjects is a technique by which large sets of data objects are groupedinto a larger number of sets or clusters of data objects (with each ofthe larger number of clusters of data objects having fewer dataobjects). Each data object contained within a clustered group of dataobjects has some similarity. One aspect of clustering therefore can beconsidered as grouping of multiple data objects.

[0021] One clustering mechanism described in this disclosure relates toa framework graph 150, one embodiment of the framework graph isillustrated in FIG. 2. Certain embodiments of a unified clusteringmechanism are provided in which different types of objects are clusteredbetween different levels or node sets P and U as shown in the frameworkgraph 150 of FIG. 2. It is also envisioned that the concepts describedin this disclosure can be applied to three or more layers, instead ofthe two layers as described in the disclosure. Each node set P and U mayalso be considered a layer. In this disclosure, the term “unified”clustering applies to a technique for clustering heterogeneous data. Thenode set P includes a plurality of data objects p₁, p₂, p₃, . . . ,p_(i) that are each of a similar data type. The node set U includes aplurality of data objects u₁, u₂, u₃, . . . , u_(j) that are each of asimilar data type. The data type of the objects clustered on each nodeset (P or U) is identical, and therefore the data objects in each nodeset (P or U) are homogenous. The type of the data objects p₁, p₂, p₃, .. . , p_(i) that are in the node set P are different from the types ofthe data objects u₁, u₂, u₃, . . . , u_(j) that are in the node set U.As such, the types of data objects that are in different ones of thenode sets P and U are different, or heterogeneous. Certain aspects ofthis disclosure provide for clustering using inputs (based on links)from homogenous and heterogeneous data types of objects.

[0022] Links are illustrated in this disclosure by lines extendingbetween a pair of data objects. Links represent the relationshipsbetween pairs of data objects in clustering. In one instance, a link mayextend from a Web page object to a user object, and represent the userselecting certain Web pages. In another instance, a link may extend froma Web page object to another Web page object, and represent relationsbetween different Web pages. In certain embodiments of clustering, the“links” are referred to as “edges”. The generalized term “link” is usedin this disclosure to describe links, edges, or any connector of oneobject to another object that describes a relationship between theobjects.

[0023] There are a variety of different types of links (as described inthis disclosure) that relate to clustering different types of objectsthat associate different ones of the objects as set forth in theframework graph 150. The links can be classified as either inter-layerlink or intra-layer link. An intra-layer link 203 or 205 is oneembodiment of link within the framework graph 150 that describesrelationships between different objects of the same type. An inter-layerlink 204 is one embodiment of link within the framework graph 150 thatdescribes relationships between objects of different types. As shown inFIG. 2, there are a plurality of intra-layer links 203 extending betweencertain one of the data objects u₁, u₂, u₃. . . , u_(j). In theembodiment shown in FIG. 2, there are also a plurality of intra-layerlinks 205 extending between certain ones of the data objects p₁, p₂, p₃,. . . , p_(i). In the embodiment shown in FIG. 2, there are also aplurality of inter-layer links 204 extending between certain ones of thedata objects u₁, u₂, u₃, . . . , u_(j) in the node set P and certainones of the data objects p₁, p₂, p₃, . . . , p_(i) in the node set U.Using inter-layer links recognizes that clustering of one type of objectmay be affected by another type of object. For instance, clustering ofweb page objects may be affected by user object configurations, state,and characteristics.

[0024] The link direction (as provided by the arrowheads for the links203, 204, or 205 in FIG. 2, and also in FIG. 3) are illustrated asbi-directional since the relationships between the data objects may bedirected in either direction. The links are considered illustrative andnot limiting in scope. Certain links in the graph in the framework graph150 may be more appropriately directed in one direction, the directionof the arrowhead typically does not affect the framework's operation.The framework graph 150 is composed of node set P, node set U, and linkset L. With the framework graph, 150, p_(i) and u_(j) represent twotypes of data objects, in which p_(i)εP (i=1, . . . , I) and u_(j)εU(j=1, . . . , J). I and J are cardinalities of the node sets P and U,respectively.

[0025] Links (p_(i), u_(j))εL are inter-layer links (which areconfigured as 2-tuples) that are illustrated by reference character 204between different types of objects. Links (p_(i), p_(j))εL and (u_(i),u_(j))εL, that are referenced by 205 and 203, respectively, areintra-layer links that extend between the same type of object. Forsimplicity, different reference characters are applied for inter-layerlink sets (204) and intra-layer link sets (203, 205).

[0026] Using unified clustering, links are more fully utilized amongobjects to improve clustering. The clustering of the different types ofobjects in the different layers is reinforced by effective clustering.If objects are clustered correctly then clustering results should bemore reasonable. Clustering can provide structuralized information thatis useful in analyzing data.

[0027] The framework graph 150 illustrates clustering of multiple typesof objects in which each type of objects is substantially identical(e.g., one type pertains to a group of web pages, a group of users, or agroup of documents, etc.). The type of each group of objects generallydiffers from the type of other groups of the objects within theframework graph 150.

[0028] The disclosed clustering technique considers and receives inputfrom different (heterogeneous) object types when clustering. One aspectof this disclosure is based on an intrinsic mutual relation in which theobjects being clustered is provided with links to other objects. Certainones of the links (and the objects to which those links connect) thatconnect to each object can be weighted with different importance toreflect their relevance to that object. For example, objects of the sametypes as those being clustered can be provided with greater importancethan objects of a different type. This disclosure provides a mechanismby which varying levels of importance can be assigned to differentobjects or different types of objects. This assigning of differentlevels of importance to different objects (or different types ofobjects) is referred to herein as clustering with importance. Thevarying levels of importance of the different objects often results inimproved clustering results and effectiveness.

[0029] In the embodiment of the framework graph 150 for clusteringheterogeneous objects as shown in FIG. 2, the different node sets P or Urepresent different layers each containing different object types. Themultiple node sets (P and U are illustrated) of the framework graph 150provide a basis for clustering. The two-layered directed graph 150contains a set of data objects to be clustered. Objects of each type ofobject types (that are to be clustered according to the clusteringalgorithm) can be considered as the instance of a “latent” class. Thelinks 203, 204, or 205 that extend between certain ones of the objectnodes reflect inherent relations among the object nodes that areprovided by the clustering. An iterative projecting technique forclustering, several embodiments of which are described in thisdisclosure, enables separate clustering of objects that have separatedata types to contribute to the clustering process.

[0030] The heterogeneous types of objects (and their associated links)are reinforced by using the iterative clustering techniques as describedherein. The iterative clustering projection technique relies onobtaining clustering information from separate types of objects that arearranged in separate layers, with each layer containing a homogenoustype of object. The node information in combination with the linkinformation is used to iteratively project and propagate the clusteredresults (the clustering algorithm is provided between layers) until theclustering converges. Iteratively clustering results of one type ofobject into the clustering results of another type of object can reduceclustering challenges associated with data sparseness. With thisiterative projecting, the similarity measure in one layer clustering iscalculated on clusters instead of individual groups of clusters ofanother type.

[0031] Each type of the different kinds of nodes and links are examinedto obtain structural information that can be used for clustering.Structural information, for example, can be obtained considering thetype of links connecting different data objects (e.g., whether a link isan inter-layer link or an intra-layer link). The type of each object isindicated by its node set P or U, as indicated in FIG. 2.

[0032] The generalized framework graph 150 of FIG. 2 can be applied to aparticular clustering application. Namely, the framework graph 150 canillustrate a group of Web pages on the Internet relative to a group ofusers. The Web page layer is grouped as the node set P. The user layerof objects is grouped as the node set U. The framework graph 150integrates the plurality of Web page objects and the plurality of userobjects in the representation of the two-layer framework graph 150. Theframework graph 150 uses link (e.g., edge) relations 203, 204, 205 tofacilitate the clustering of the different type of objects (as outlinedby the generalized FIG. 2 framework graph). The link structure of thewhole data set is examined during the clustering procedure to learn thedifferent importance level of nodes. The nodes are weighted based ontheir importance in the clustering procedure to ensure that importantnodes are clustered more reasonably.

[0033] In certain embodiments of the present disclosure, the links 203,204, and 205 among clusters in the links are reserved. Reserved linksare those links that extend between clusters of objects instead of theobjects themselves. For example, one reserved link extends between aweb-page cluster and a user cluster (instead of between a web pageobject and a user object as with the original links). In certainembodiments, the reserved links are maintained for a variety of futureapplications, such as a recommendation in the framework graph 150. E.g.,the clustering result of Web page/user clustering with reserved linkscould be shown as a summary graph of user hits behaviors, which providesthe prediction of user's hits.

[0034] The content of the respective nodes p_(i) and u_(j) are denotedby the respective vectors f_(i) and g_(j) (not shown in FIG. 2).Depending on the application, each individual node p_(i) and u_(j) mayhave (or may not have any) content features. Prior-art clusteringtechniques cluster the nodes p_(i) independently from the nodes u_(j).In contrast, in the clustering framework 150 described in thisdisclosure the nodes p_(i) and the nodes u_(j) are clustered dependentlybased on their relative importance. The clustering algorithm describedherein uses a similarity function to measure distance between objectsfor each cluster type to produce the clustering. The cosine-similarityfunction as set forth in (1) can be used for clustering: $\begin{matrix}{{s_{c}\left( {x,y} \right)} = {{\cos \left( {f_{x},f_{y}} \right)} = \frac{\sum\limits_{i = 1}^{kx}{{f_{x}(i)} \cdot {\sum\limits_{i = 1}^{ky}{f_{y}(j)}}}}{\sqrt{\sum\limits_{i = 1}^{kx}{f_{x}^{2}(i)}} \cdot \sqrt{\sum\limits_{j = 1}^{ky}{f_{y}^{2}(j)}}}}} & (1)\end{matrix}$

$\begin{matrix}{{s_{c}\left( {x,y} \right)} = {{\cos \left( {f_{x},f_{y}} \right)} = {\frac{f_{x} \cdot f_{y}}{{f_{x}}{f_{y}}} = \frac{\sum\limits_{k,{{f_{x}{(k)}} = {f_{y}{(k)}}}}{{f_{x}(k)}{f_{y}(k)}}}{\sqrt{\sum\limits_{i = 1}^{kx}{f_{x}^{2}(i)}} \cdot \sqrt{\sum\limits_{j = 1}^{ky}{f_{y}^{2}(j)}}}}}} & (2)\end{matrix}$

[0035] f_(x)·f_(y) is the dot product of two feature vector. It equalsto the sum of weight product of the same component in fx and fy. s_(c)denotes that the similarity is based on content feature; f_(x)(i) andf_(y)(j) are ith and jth components of the feature vector f_(x) andf_(y). kx is the number of items in the respective feature f_(x); and kyis the number of items in the feature f_(y).

[0036] In this disclosure, the node set P is used as an example toillustrate the inter-layer link 204 and the intra-layer links 203 and205 of the nodes. All data is assumed to comprise a sequence of nodepairs, for intra-layer node pairs (p⁽¹⁾, p⁽¹⁾), (p⁽²⁾, p⁽²⁾), . . .[where p⁽¹⁾ and p⁽²⁾ are the same as p_(i), and the pairs (p⁽¹⁾, p⁽¹⁾),(p⁽²⁾, p⁽²⁾), both stands for a node in the homogeneous layer] such asconnected by links 203 or 205; and for inter-layer pairs (p⁽¹⁾, u⁽¹⁾),(p⁽²⁾, u⁽²⁾, . . . , such as connected by links 204. Thus a link betweena pair of nodes (p_(i), p_(k)) or (p_(i), u_(j)) represents one or moreoccurrence of identical pairs in the data series. The weight of the linkrelates to its occurrence frequency.

[0037] In this disclosure, two separate vectors represent features ofthe inter-layer links 204 and the intra-layer links 203, 205 for eachparticular node. For example, the intra-layer link 203, 205 features arerepresented using a vector whose components correspond to other nodes inthe same layer. By comparison the inter-layer link 204 feature isrepresented using a vector whose components correspond to nodes inanother layer. Each component could be a numeric value representing theweight of link from (or to) the corresponding node. For example, theinter-layer link 204 feature of nodes p₁ and p₂ (as shown in FIG. 2) canbe represented as [1, 0, 0, . . . , 0]^(T) and [1, 1, 1, . . . , 0]^(T),respectively.

[0038] Thus, the corresponding similarity function could be defined ascosine-similarity as above. The similarity function s_(lx)(x,y) forintra-layer link 203, 205 features determines the similarity betweennodes p₁ and p₂ is applied is described in (3) as follows:$\begin{matrix}{{s_{l\quad 1}\left( {x,y} \right)} = {{\cos \left( {l_{x},l_{y}} \right)} = \frac{l_{x} \cdot l_{y}}{{l_{x}}{l_{y}}}}} & (3)\end{matrix}$

[0039] By comparison, the similarity function s_(lx)(x,y) forinter-layer link 204 features determines the similarity between nodes p₁and u₂ in (4) as follows:

s _(l2)(x, y)=cos(h _(x) , h _(y))  (4)

[0040] where s_(l1) and s_(l2) respectively denote that the similaritiesare based on respective intra-layer and inter-layer link features; l_(x)and l_(y) are intra-layer link feature vectors of node x and node y;while h_(x) and h_(y) are inter-layer link feature vectors of node x andnode y.

[0041] Other representations of link features and other similaritymeasures could be used, such as representing links of each node as a setand applying a Jaccard coefficient. There are multiple advantages of theembodiments described herein. One advantage is that certain ones of theembodiments of clustering algorithms accommodate weighted links.Moreover, such clustering algorithms, as the k-means clusteringalgorithm, facilitate the calculation of the centroid of the clustering.The centroid is useful in further calculations to indicate a generalizedvalue or characteristic of the clustered object.

[0042] The overall similarity function of node x and node y can bedefined as the weighted sum of the three similarities including thethree weighted values α, β, and γ as set forth in (5). There are twodisclosed techniques to assign the three weighted values: heuristicallyand by training. If, for example, there is no tuning data, the weightsare assigned manually to some desired value (e.g. alpha=0.5, beta=0.25,and gamma=0.25). If there is some extra tuning data, by comparison, thenthe weights can be calculated using a greedy algorithm, a hill-climbingalgorithm, or some other type of either local or global improvement oroptimizing program. A greedy algorithm refers to a type of optimizationalgorithm that seeks to improve each factor in each step, so thateventually an improved (and optimized in certain embodiments) solutioncan be reached.

s(x, y)=αs _(c)(x, y)+βs _(l1)(x, y) +γs _(l2)(x, y)  (5)

[0043] where α+β+γ=1.

[0044] Using these calculations, the content of the nodes, and thesimilarity of the nodes, are determined. Depending on the application,the three variables can be modified to provide different informationvalues for the clustering algorithm. These contents and similarities ofthe nodes can thereupon be used as a basis for retrieval.

[0045] Many heterogeneous clustering problems often share the sameproperty that the nodes are not equally important. Examples ofheterogeneous clustering include Web page/user clustering, item/userclustering for collaborative filtering, etc. For these applications,important objects play an important role in getting more reasonableclustering results. In this disclosure, the link structure of the wholedataset is used to learn the importance of nodes. For each node in thenode set P and U, for example p_(i) and u_(j), importance weightsip_(i), and iu_(j) are calculated by the link structure and are used inclustering procedure.

[0046] One clustering aspect relates a link analysis algorithm, multipleembodiments of which are provided in this disclosure. In one embodimentof the link analysis algorithm, a hybrid net model 400 as shown in FIG.3 is constructed. Using the hybrid net model 400, the users and the Webpages are used as two illustrative types of nodes. The FIG. 3 embodimentof hybrid net model involving Web page and user types of objects isparticularly directed to types of clustering involving the Internet,intranets, or other networks. The links include Web pagehyperlinks/interactions as shown by link 405, user-to-Web pagehyperlinks/interactions as shown by link 404, and user-to-userhyperlinks/interactions as shown by link 403. The hybrid net model 400of FIG. 3 explicates these hyperlinks/relations by indicating therelations in and between users and Web pages that are illustrated bylinks 403, 404, and 405.

[0047] Given a certain group of users 408 that are contained within auser set 410, all Web pages that any of the nodes from the user set 410have visited form the Web page set 412. The Web page set 412 isdetermined by sending the root Web page set to search engines and obtaina base Web page set. Three kinds of links represented by the arrows inFIG. 3 have different meanings. Those links represented by the arrows405 that are contained within the Web page set 412 indicate hyperlinksbetween Web pages. Those links represented by arrows 403 that arecontained within the user set 410 indicate social relations among users.Those links represented by arrows 404 that extend between the users set410 and the Web page set 412 indicate the user's visiting actions towardWeb pages. The links represented by arrows 404 indicate the user'sevaluation of each particular Web page, so the authority/hub score of aWeb page will be more credible. Since the different types of links 403,404, and 405 represent different relations. Each link can be weightedwith a different importance depending, for example, on how often thelink is accessed or how each pair of nodes that are connected by thelink are associated.

[0048]FIG. 4 illustrates one embodiment of the computer environment 20that is configured to perform clustering using the Internet. One aspectof such clustering may involve clustering the Web pages based on users(including the associated inter-layer links and the intra-layer links).The computer environment includes a plurality of Web sites 350, a searchengine 352, a server/proxy portion 354, a modeling module 356, acomputing module 358, and a suggestion/reference portion 360. Thecomputer environment 20 interfaces with the users 362 such as with agraphical user interface (GUI). The computing module 358 includes aniterative computation portion 380 that performs the clustering algorithm(certain embodiments of which rely on iterative computation). Themodeling module 356 acts to collect data and track data (e.g.,associated with the objects). The search engines return search resultsbased on the user's query. The Web sites 350 represent the data as it ispresented to the user. The server/proxy communicates the queries and thelike to a server that performs much of the clustering. Thesuggestion/reference portion 360 allows the user to modify or select theclustering algorithm.

[0049] The modeling module 356 includes a prior formalization portion370, a webpage extraction portion 372, and a user extraction portion374. Portions 370, 372, and 374 are configured to provide and/or trackdata that has been previously formalized 370, is extracted from a Webpage, or is extracted from the user 362. The embodiment of computerenvironment as illustrated in FIG. 4 is configured to provide a linkanalysis algorithm, one embodiment of which is described in thisdisclosure.

[0050] One embodiment of clustering algorithm can analyze a Web graph bylooking for two types of pages: hubs, authorities, and users. Hubs arepages that link to a number of other pages that provide useful relevantinformation on a particular topic. Authority pages are considered aspages that are relevant to many hubs. Users access each one ofauthorities and hubs. Each pair of hubs, authorities, and users therebyexhibit a mutually reinforcing relationship. The clustering algorithmrelies on three vectors that are used in certain embodiments of thepresent link analysis algorithm: the web page authority weight vector a,the hub weight vector h, and the user vector u. Certain aspects of thesevectors are described in this disclosure.

[0051] Several of the following terms relating to the following weightcalculations are not illustrated in the figures such as FIG. 3, andinstead relate to the calculations. In one embodiment, for a given useri, the user weight u_(i) denotes his/her knowledge level. For a Web pagej, respective terms a_(j) and h_(j) indicate the authority weight andthe hub weight. In one embodiment, each one of the three vectors(representing the user weight u, the web page authority weight a, andthe hub weight h) are each respectively initialized at some value (suchas 1). All three vectors h, a, and u are then iteratively updated basedon the Internet usage considering the following calculations as setforth respectively in (6), (7), and (8): $\left\{ \begin{matrix}{{{a(p)} = {{\sum\limits_{q->p}{h(q)}} + {\sum\limits_{r->p}{u(r)}}}}} & {\quad (6)} \\{{h(p)} = {{\sum\limits_{p->q}{a(q)}} + {\sum\limits_{r->p}{u(r)}}}} & {{{\quad\quad}\quad}\quad (7)} \\{{{u(r)} = {{\sum\limits_{r->p}{a(p)}} + {\sum\limits_{r->q}{h(q)}}}}} & {\quad (8)}\end{matrix}\quad \right.$

[0052] where, p and q stand for specific web-pages, and r stands for aspecific user. There are two kinds of links in certain embodiments ofthe disclosed network: the links between different pages (hyperlinks)and the links between users and pages (browsing patterns). LetA=[a_(ij)] denote the adjacent matrix of the base set for all threevectors h, a, and u. Note that a_(ij)=1 if page i links to page j, oralternatively a_(ij)=0. V=[v_(ij)] is the visit matrix of the user setto Web page set. Consider that v_(ij)=1 if user i visit page j, oralternatively v_(ij)=0. Also, as set forth in (8), (10), and (11):$\left\{ \begin{matrix}{{a = {{A^{T}h} + {V^{T}u}}}\quad} & {{~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~}(9)} \\{h = {{A\quad a} + {V^{T}u}}} & {{{\quad\quad}\quad}\quad (10)} \\{{u = {V\left( {a + h} \right)}}} & {\quad (11)}\end{matrix}\quad \right.$

[0053] In a preferred embodiment, the calculation for vectors a, h, u asset forth in (9), (10), and (11) go through several iterations toprovide meaningful results. Prior to the iterations in certainembodiments, a random value is assigned to each one of the vectors a, h,and u. Following each iteration, the values of a, h, u will be changedand normalized to provide a basis for the next iteration. Following eachiteration, the iterative values of a, h, and u each tend to converge toa certain respective value. The users with high user weight u_(i) andWeb pages with high authority weight a_(j) and/or hub weight h_(j) canbe reported. In a preferred embodiment, certain respective user orweb-page objects can be assigned with higher values than otherrespective user or web-page objects. The higher the value is, the moreimportance is assigned to that object.

[0054] The embodiment of link analysis algorithm as described in thisdisclosure that can cluster thereby relies on iterative input from bothWeb pages and users. As such, weighted input from the user is applied tothe clustering algorithm of the Web page. Using the weighted user inputfor the clustering improves the precision of the search results, and thespeed at which the clustering algorithm can be performed.

[0055] While the link analysis algorithm described herein is applied toclustering algorithms for clustering Web pages based on users, it isenvisioned that the link analysis algorithm can be applied to anyheterogeneous clustering algorithm. This weighting partially providesfor the clustering with importance as described herein.

[0056] A variety of embodiments of a clustering algorithm that can beused to cluster object types are described. Clustering algorithmsattempt to find natural groups of data objects based on some similaritybetween the data objects to be clustered. As such, clustering algorithmsperform a clustering action on the data objects. Certain embodiments ofclustering algorithm also finds the centroid of a group of data sets,which represents a point whose parameter values are the mean of theparameter values of all the points in the clusters. To determine clustermembership, most clustering algorithms evaluate the distance between apoint and the cluster centroid. The output from a clustering algorithmis basically a statistical description of the cluster centroids with thenumber of components in each cluster.

[0057] Multiple embodiments of cluster algorithms are described in thisdisclosure. The two-ways k-means cluster algorithm is based on themutual reinforcement of clustering process. The two-ways k-means clusteralgorithm is an iterative clustering algorithm. In the two-ways k-meanscluster algorithm, the object importance is first calculated by (6)-(8)or (9)-(11), and the result is then applied in the followed iterativeclustering procedure. The clustering algorithm clusters objects in eachlayer based on the defined similarity function. Although a great deal ofclustering algorithms, such as k-means, k-medoids, and agglomerativehierarchical methods could be used, this disclosure describes theapplication of the k-means clustering algorithm.

[0058] There are several techniques to apply the calculated importancescore of nodes. One technioque involves modifying the basic k-meansclustering algorithm to a ‘weighted’ k-means algorithm. In the modifiedk-means algorithm, the centroid of the given cluster is calculated usingthe weighted sum of the features with the weight setting determining theimportance score. The nodes having a higher importance or weighting arethereby given more emphasis in forming the cluster centroid for both thecontent and the link features. Another embodiment involves modifying thenodes' link weight by their importance score, and then using theweighted link feature in the similarity function. In this way, theimportance of the nodes is only reflected in the link feature inclustering process.

[0059] One embodiment of the input/output of the clustering algorithm isshown in FIGS. 5a and 5 b. The input to the clustering algorithmincludes a two-layered framework graph 150 (including the contentfeatures f_(i) and g_(j) of the nodes). The output to the clusteringalgorithm includes a new framework graph 150 that reflects theclustering. In certain embodiments of the new framework graph, thevariations of each old node that has changed into its new node positioncan be illustrated.

[0060] One embodiment of a flow chart illustrating one embodiment of theclustering algorithm 450 is shown in FIGS. 5a and 5 b. The clusteringalgorithm 450 includes 451 in which the original framework graph (priorto each clustering iteration) is input. In 452, the importance of eachnode being considered is determined or calculated using (6)-(8) or(9)-(11). In 454, an arbitrary layer is selected for clustering. Nodesin the selected layer are clustered in an appropriate fashion (e.g.,according to content features) in 455. In certain embodiments, the nodescan be filtered using a desired filtering algorithm (not shown) toimprove the clustering. In 456, the nodes of each cluster are mergedinto one node. For instance, if two candidate nodes exist following thefiltering, the closest two candidate nodes can be merged by, e.g.,averaging the vector values of the two candidate nodes. This mergingallows individual nodes to be combined to reduce the number of nodesthat have to be considered. As such, the merging operation can be usedto reduce the occurrence of duplicates and near-duplicates.

[0061] The corresponding links are updated based on the merging in 457.In 458, the clustering algorithm switches to a second layer (from thearbitrarily selected layer) for clustering. In 460, the nodes of thesecond layer are clustered according to their content features andupdated link features. In 461, the nodes of each cluster are merged intoone node.

[0062] In 462, the original link structure and the original nodes of theother layer are restored. In 464, the nodes of each cluster of thesecond layer are merged, and the corresponding links are updated. In466, this iterative clustering process is continued within the computerenvironment. In 468, a revised version of the framework graph 150 isoutput.

[0063] In the initial clustering pass, only the content features areutilized. Because in most cases the link feature are too sparse in thebeginning to be useful for clustering. In subsequent clustering passes,content features and link features are combined to enhance theeffectiveness of the clustering. By combining the content features andthe link features, the weights are specified with different values andthe results can be compared, and clustering having an improved accuracycan be provided.

[0064] The clustering algorithm as described relative to FIGS. 5a and 5b can be applied to many clustering embodiments. More particularly, oneembodiment of clustering of Web pages based on how the Web pages areaccessed by users is now described. In those types of link extendsbetween a node of the user layer to a node of the Web page layer, a useru_(j) has visited a Web page p_(i) before if there is one link fromu_(j) to p_(i). The weight of the link means the probability that theuser u_(j) will visit the page p_(i) at a specific time, denoted asPr(p_(i)|u_(j)). It can be simply calculated by counting the numberswithin the observed data, as shown in (12). $\begin{matrix}{{\Pr \left( {p_{i}u_{j}} \right)} = \frac{C\left( {p_{i},u_{j}} \right)}{\sum\limits_{t \in {P{(u_{j})}}}{C\left( {p_{t},u_{j}} \right)}}} & (12)\end{matrix}$

[0065] where, P(u_(j)) is the set of pages that visited by the useru_(j) before. C(p_(i),u_(j)) stands for the count that the user u_(j)have visited page p_(i) before.

[0066] One embodiment of clustering algorithm, as shown in theembodiment of framework graph 150 of FIG. 6, involves a concept layer orhidden layer. In FIG. 6, for simplicity, the intra-layer link 203 and205 that are shown in the framework graph of FIG. 2 are hidden. It isenvisioned, however, that the embodiment of framework graph 150 as shownin FIG. 6 can rely on any combination of intra-layer links andinter-layer links and still remain within the concepts of the presentdisclosure.

[0067] The hidden layer 670 (in the embodiment of framework graph 150 asdisplayed in FIG. 6) lies between web-page layer and user layer. Thehidden layer 150 provides an additional layer of abstraction (from whichlinks extend to each of the node sets P and U) that permit modeling withimproved realism compared to extending links between the original nodesets P and U. One of the inter-layer links 204 of the embodiment offramework graph 150 such as shown in FIG. 2 (that does not have a hiddenlayer) may be modeled as a pair of hidden inter-layer links of theembodiment of framework graph 150 such as shown in FIG. 6. One of thehidden inter-layer links extends between the web-page layer containingthe node set P and the hidden layer 670, and one of the hiddeninter-layer links extends between the user layer and the hidden layer670. The direction of the arrows on each hidden inter-layer link shownin FIG. 6 is arbitrary, as is the particular web pages and users in therespective node sets P and U that are connected by a hidden inter-layerlink to a node in the hidden layer.

[0068] Links (i.e., hidden inter-layer links) that extend between theweb-page layer containing the node set P and the hidden layer 670indicate how likely a web-page p₁, p₂, etc. belongs to a particularconcept node P(c₁), P(c₂), etc. in the hidden layer 670. Links (i.e.,hidden inter-layer links) that extend between the user layer and thehidden layer 670 indicate how likely a user node u₁, u₂, etc. hasinterest in a particular concept node P(c₁), P(c₂), etc. within thehidden layer 670.

[0069] The links that extend between the web-page layer and the conceptlayer therefore each stand for the probability that a Web page p_(i) isclassified into a concept category c_(k), denoted as Pr(p_(i)|c_(k)).This model embodied by the framework graph shares the assumption used byNa{umlaut over (l)}ve Bayesian classification, in which different wordsare considered conditionally independent. So the concept c_(k) can berepresented as a normal distribution, i.e. a vector {overscore (μ)}_(k)for expectation and {right arrow over (σ)}_(k) vector for covariance.The value Pr(p_(i)|c_(k)) can be derived as per (13). $\begin{matrix}\begin{matrix}{{E\left( {\Pr \left( {p_{i}c_{k}} \right)} \right)} = \frac{\Pr \left( {p_{i}c_{k}} \right)}{\sum\limits_{t}{\Pr \left( {p_{i}c_{k}} \right)}}} \\{= \frac{\prod\limits_{t}{\Pr \left( {w_{l,i}c_{k}} \right)}}{\sum\limits_{t}{\prod\limits_{l}{\Pr \left( {w_{l,t}c_{k}} \right)}}}} \\{= \frac{^{- {\sum\limits_{l}{\frac{1}{2\sigma_{l,k}}{({w_{l,i} - \mu_{l,k}})}^{2}}}}}{\underset{t}{\Sigma}^{- {\sum\limits_{l}{\frac{1}{2\sigma_{l,k}}{({w_{l,k} - \mu_{l,k}})}^{2}}}}}}\end{matrix} & (13)\end{matrix}$

[0070] where w_(l,i) is the weight of web page p_(i) on the lth word.

[0071] Those links (denoted as Pr(c_(k)|u_(j))) that extend between anode in the user layer and a node in the hidden layer reflect theinterest of the user in the category reflected by the concept. Thus, onevector (I_(j1),I_(j2), . . . ,I_(jn)), I_(jk)=Pr(c_(k)|u_(j))corresponds to each user, in which n is the number of the hiddenconcept. The links shown in FIG. 6 can be considered as the vectormodels of the user. The vector is constrained by the user's usage dataas set forth in (14). $\begin{matrix}\begin{matrix}{{\Pr \left( {p_{i}u_{j}} \right)} = {\sum\limits_{l}{{\Pr \left( {{p_{i}c_{l}},u_{j}} \right)}{\Pr \left( {c_{l}u_{j}} \right)}}}} \\{\approx {\sum\limits_{l}{{\Pr \left( {p_{i}c_{l}} \right)}{\Pr \left( {c_{l}u_{j}} \right)}}}}\end{matrix} & (14)\end{matrix}$

[0072] Thus, the value Pr(c_(k)|u_(j)) can be obtained by finding thesolution from (13).

[0073] To simplify, Pr(p_(i)|u_(j))=R_(i,j), Pr(p_(i)|c_(k))=S_(i,k),and Pr(c_(k)|u_(j))=T_(k,j). The user j can be considered separately asset forth in (15). $\begin{matrix}{\begin{bmatrix}R_{1,j} \\R_{2,j} \\\cdots \\R_{{{Page}},j}\end{bmatrix} = {\begin{bmatrix}S_{1,1} & S_{1,2} & \cdots & S_{1,{{Concept}}} \\S_{2,1} & S_{2,2} & \quad & \quad \\\quad & \quad & \cdots & \quad \\S_{{{Page}},1} & \quad & \cdots & S_{{{Page}},{{Concept}}}\end{bmatrix} \times \begin{bmatrix}T_{1,j} \\T_{2,j} \\\cdots \\T_{{{Concept}},j}\end{bmatrix}}} & (15)\end{matrix}$

[0074] where “|Page|” is the total number of the Web pages, and“|Concept|” is the total number of the hidden concept. Since|Page|>>|Concept|, a least square solution of T_(k,j) can be solvedusing (15), or alternatively (16). $\begin{matrix}{\begin{bmatrix}R_{i,1} & R_{i,2} & \cdots & R_{i,{{User}}}\end{bmatrix} = {\begin{bmatrix}S_{i,1} & S_{i,2} & \cdots & S_{i,{{Concept}}}\end{bmatrix} \times \begin{bmatrix}T_{1,1} & T_{1,2} & \cdots & T_{1,{{User}}} \\T_{2,1} & T_{2,2} & \quad & \quad \\\cdots & \quad & \cdots & \quad \\T_{{{Concept}},1} & \quad & \quad & T_{{{Concept}},{{User}}}\end{bmatrix}}} & (16)\end{matrix}$

[0075] where “|User|” is the total number of the users.

[0076] Since |User|>>|Concept|, we can also give a least square solutionof S_(i,k) as set forth in (17). $\begin{matrix}{{\overset{->}{\mu}}_{j} = {{\sum\limits_{t}^{\quad}\quad {{\overset{->}{P}}_{t}{\Pr \left( {p_{t}c_{k}} \right)}}} = {\sum\limits_{k}^{\quad}\quad {S_{t,k}{\overset{->}{P}}_{t}}}}} & (17)\end{matrix}$

[0077] After the vector for expectation {right arrow over (μ)}_(j) isobtained, a new vector for covariance {right arrow over (σ)}_(j) can becalculated. While the embodiment of framework graph 150 that isillustrated in FIG. 6 extends between the node set P and the node set U,it is envisioned that the particular contents of the node sets areillustrative in nature, and can be applied to any set of node sets.

[0078] One embodiment of the clustering algorithm in which Web pageobjects are clustered based on user objects can be outlined as followsas described relative to one embodiment of Web page clustering algorithmshown as 600 in FIG. 7:

[0079] 1. Collect a group of users' logs as shown in 602.

[0080] 2. Calculate the probability of the user u_(j) will visit the Webpage p_(i) at a specific time Pr(p_(i)|u_(j)) as set forth by (12), and604 in FIG. 7.

[0081] 3. Define the number |Concept| of nodes for the hidden conceptlayer (670 as shown in FIG. 6) in 606 of FIG. 7, and randomly assign theinitial parameters for the vector for expectation {right arrow over(μ)}_(k) and the initial vector for covariance {right arrow over(σ)}_(k) in 608 of FIG. 7.

[0082] 4. Calculate a Pr(p_(i)|c_(k)) value, which represents theprobability that a Web page p_(i) is classified into a concept categoryc_(k), as set forth in (13) and 610 in FIG. 7.

[0083] 5. Calculate Pr(c_(k)|u_(j)), which represents the users interestin the links between a user node and a hidden layer node, which can bederived by (15) as shown in 612 in FIG. 7.

[0084] 6. Update the Pr(p_(i)|c_(k)) probability that a Web page isclassified into a concept category as determined in the outline step 4by solving (13) as shown in 614 of FIG. 7.

[0085] 7. Re-estimate the parameters for each hidden concept node byusing Pr(p_(i)|c_(k)) as set forth in (13) as described relative to 616in FIG. 7.

[0086] 8. Go through (13) and (15) for several iterations to providesome basis for the values of the node sets (or at least until the modeldisplays stable node set vector results).

[0087]FIG. 8 illustrates an example of a suitable computer environmentor network 500 that includes a user interface which can provide aclustering algorithm and/or a framework graph. Similar resources may usethe computer environment and the processes as described herein.

[0088] The computer environment 100 illustrated in FIG. 8 is a generalcomputer environment, which can be used to implement the concept networktechniques described herein. The computer environment can be consideredas an embodiment of the embodiments of the computer embodiment 20described above relative to FIGS. 1 and 4. The computer environment 100is only one example of a computer environment and is not intended tosuggest any limitation as to the scope of use or functionality of thecomputer and network architectures. Neither should the computerenvironment 100 be interpreted as having any dependency or requirementrelating to any one or combination of components illustrated in theexemplary computer environment 100.

[0089] The computer environment 100 includes a general-purpose computingdevice in the form of a computer 502. The computer 502 can be, forexample, one or more of a stand alone computer, a networked computer, amainframe computer, a PDA, a telephone, a microcomputer ormicroprocessor, or any other computer device that uses a processor incombination with a memory. The components of the computer 502 caninclude, but are not limited to, one or more processors or processingunits 504 (optionally including a cryptographic processor orco-processor), a system memory 506, and a system bus 508 that couplesvarious system components including the processor 504 and the systemmemory 506.

[0090] The system bus 508 represents one or more of any of several typesof bus structures, including a memory bus or memory controller, aperipheral bus, an accelerated graphics port, and a processor or localbus using any of a variety of bus architectures. By way of example, sucharchitectures can include an Industry Standard Architecture (ISA) bus, aMicro Channel Architecture (MCA) bus, an Enhanced ISA (EISA) bus, aVideo Electronics Standards Association (VESA) local bus, and aPeripheral Component Interconnects (PCI) bus also known as a Mezzaninebus.

[0091] The computer 502 can include a variety of computer readablemedia. Such media can be any available media that is accessible by thecomputer 502 and includes both volatile and non-volatile media, andremovable and non-removable media.

[0092] The system memory 506 includes the computer readable media in theform of non-volatile memory such as read only memory (ROM) 512, and/orvolatile memory such as random access memory (RAM) 510. A basicinput/output system (BIOS) 514, containing the basic routines that helpto transfer information between elements within the computer 502, suchas during start-up, is stored in the ROM 512. The RAM 510 can containdata and/or program modules that are immediately accessible to, and/orpresently operated on, by the processing unit 504.

[0093] The computer 502 may also include other removable/non-removable,volatile/non-volatile computer storage media. By way of example, FIG. 9illustrates a hard disk drive 515 for reading from and writing to anon-removable, non-volatile magnetic media (not shown), a magnetic diskdrive 518 for reading from and writing to a removable, non-volatilemagnetic disk 520 (e.g., a “floppy disk”), and an optical disk drive 522for reading from and/or writing to a removable, non-volatile opticaldisk 524 such as a CD-ROM, DVD-ROM, or other optical media. The harddisk drive 515, magnetic disk drive 518, and optical disk drive 522 areeach connected to the system bus 508 by one or more data mediainterfaces 527. Alternatively, the hard disk drive 515, magnetic diskdrive 518, and optical disk drive 522 can be connected to the system bus508 by one or more interfaces (not shown).

[0094] The disk drives and their associated computer-readable mediaprovide non-volatile storage of computer readable instructions, controlnode data structures, program modules, and other data for the computer502. Although the example illustrates a hard disk within the hard diskdrive 515, a removable magnetic disk 520, and a non-volatile opticaldisk 524, it is to be appreciated that other types of the computerreadable media which can store data that is accessible by a computer,such as magnetic cassettes or other magnetic storage devices, flashmemory cards, CD-ROM, digital versatile disks (DVD) or other opticalstorage, random access memories (RAM), read only memories (ROM),electrically erasable programmable read-only memory (EEPROM), and thelike, can also be utilized to implement the exemplary computerenvironment 100.

[0095] Any number of program modules can be stored on the hard diskcontained in the hard disk drive 515, magnetic disk 520, non-volatileoptical disk 524, ROM 512, and/or RAM 510, including by way of example,the OS 526, one or more application programs 528, other program modules530, and program data 532. Each OS 526, one or more application programs528, other program modules 530, and program data 532 (or somecombination thereof) may implement all or part of the residentcomponents that support the distributed file system.

[0096] A user can enter commands and information into the computer 502via input devices such as a keyboard 534 and a pointing device 536(e.g., a “mouse”). Other input devices 538 (not shown specifically) mayinclude a microphone, joystick, game pad, satellite dish, serial port,scanner, and/or the like. These and other input devices are connected tothe processing unit 504 via input/output interfaces 540 that are coupledto the system bus 508, but may be connected by other interface and busstructures, such as a parallel port, game port, or a universal serialbus (USB).

[0097] A monitor, flat panel display, or other type of computer display200 can also be connected to the system bus 508 via an interface, suchas a video adapter 544. In addition to the computer display 200, otheroutput peripheral devices can include components such as speakers (notshown) and a printer 546 which can be connected to the computer 502 viathe input/output interfaces 540.

[0098] The computer 502 can operate in a networked environment usinglogical connections to one or more remote computers, such as a remotecomputer device 548. By way of example, the remote computer device 548can be a personal computer, portable computer, a server, a router, anetwork computer, a peer device or other common network node, gameconsole, and the like. The remote computer device 548 is illustrated asa portable computer that can include many or all of the elements andfeatures described herein relative to the computer 502.

[0099] Logical connections between the computer 502 and the remotecomputer device 548 are depicted as a local area network (LAN) 550 and ageneral wide area network (WAN) 552. Such networking environments arecommonplace in offices, enterprise-wide computer networks, intranets,and the Internet.

[0100] When implemented in a LAN networking environment, the computer502 is connected to a local network 550 via a network interface oradapter 554. When implemented in a WAN networking environment, thecomputer 502 can includes a modem 556 or other means for establishingcommunications over the wide network 552. The modem 556, which can beinternal or external to the computer 502, can be connected to the systembus 508 via the input/output interfaces 540 or other appropriatemechanisms. It is to be appreciated that the illustrated networkconnections are exemplary and that other means of establishingcommunication link(s) between the computers 502 and 548 can be employed.

[0101] In a networked environment, such as that illustrated with thecomputer environment 100, program modules depicted relative to thecomputer 502, or portions thereof, may be stored in a remote memorystorage device. By way of example, remote application programs 558reside on a memory device of the remote computer 548. For purposes ofillustration, application programs and other executable programcomponents such as the operating system are illustrated herein asdiscrete Web blocks, although it is recognized that such programs andcomponents reside at various times in different storage components ofthe computer 502, and are executed by the data processor(s) of thecomputer 502. It will be appreciated that the network connections shownand described are exemplary and other means of establishing acommunications link between the computers may be used.

[0102] Various modules and techniques may be described herein in thegeneral context of the computer-executable instructions, such as programmodules, executed by one or more computers or other devices. Generally,program modules include routines, programs, control objects 650,components, control node data structures 654, etc. that performparticular tasks or implement particular abstract data types. Often, thefunctionality of the program modules may be combined or distributed asdesired in various embodiments.

[0103] An implementation of these modules and techniques may be storedon or transmitted across some form of the computer readable media.Computer readable media can be any available media that can be accessedby a computer. By way of example, and not limitation, computer readablemedia may comprise “computer storage media” and “communications media.”

[0104] “Computer storage media” includes volatile and non-volatile,removable and non-removable media implemented in any process ortechnology for storage of information such as computer readableinstructions, control node data structures, program modules, or otherdata. Computer storage media includes, but is not limited to, RAM, ROM,EEPROM, flash memory or other memory technology, CD-ROM, digitalversatile disks (DVD) or other optical storage, magnetic cassettes,magnetic tape, magnetic disk storage or other magnetic storage devices,or any other medium which can be used to store the desired informationand which can be accessed by a computer.

[0105] The term “communication media” includes, but is not limited to,computer readable instructions, control node data structures, programmodules, or other data in a modulated data signal, such as carrier waveor other transport mechanism. Communication media also includes anyinformation delivery media. The term “modulated data signal” means asignal that has one or more of its characteristics set or changed insuch a manner as to encode information in the signal. By way of example,and not limitation, communication media includes wired media such as awired network or direct-wired connection, and wireless media such asacoustic, RF, infrared, and other wireless media. Combinations of any ofthe above are also included within the scope of computer readable media.

[0106] Although systems, media, methods, approaches, processes, etc.have been described in language specific to structural and functionalfeatures and/or methods, it is to be understood that the inventiondefined in the appended claims is not necessarily limited to thespecific features or methods described. Rather, the specific featuresand methods are disclosed as exemplary forms of implementing the claimedinvention.

1. A method comprising: selecting a group of objects of a first objecttype and a group of objects of a second object type based on a relativeimportance of certain objects of the first object type and certainobjects of the second object type.
 2. The method of claim 1, wherein thefirst object type is a Web page object type.
 3. The method of claim 1,wherein the second object type is a user object type.
 4. The method ofclaim 1, wherein the clustering further comprises arranging a frameworkgraph that includes a first layer corresponding to the first object typeand a second layer corresponding to the second object type.
 5. Themethod of claim 4, further comprising connecting a first set of linkswithin the framework between certain objects of the first object typeand certain objects of the second object type, connecting a second setof links within the framework between certain objects of the firstobject type and certain objects of the first object type, and connectinga third set of links within the framework between certain objects of thesecond object type and certain objects of the second object type.
 6. Themethod of claim 5, wherein each link of the first set of links is aninter-layer links.
 7. The method of claim 5, wherein each link of thesecond set of links is an intra-layer links.
 8. The method of claim 5,wherein each link of the third set of links is an intra-layer links. 9.The method of claim 1, wherein the first object type and the secondobject type are heterogeneous objects.
 10. The method of claim 9,wherein the importance of the heterogeneous objects are reinforced byiterative clustering.
 11. The method of claim 1, wherein the firstobject type is homogeneous.
 12. The method of claim 1, wherein thesecond object type is homogeneous.
 13. The method of claim 1, whereinthe clustering the group of objects relies on a clustering algorithm.14. An apparatus, comprising: a framework graph including: a first nodeset and a second node set, wherein both the first node set and thesecond node set store a plurality of objects, and wherein each one ofthe plurality of objects stored in the first node set is of a similarobject type, and wherein each one of the objects stored in the secondnode set is of a similar object type, at least one inter-layer link thatconnects an object of the first node set with an object of the secondnode set; and a clustering mechanism that clusters objects of the firstnode set based on weighted values of the at least one inter-layer link.15. The apparatus of claim 14, wherein the first node set includes a setof Web page objects.
 16. The apparatus of claim 14, wherein the secondnode set includes a set of user objects.
 17. The apparatus of claim 14,wherein the objects stored in the first node set are homogenous.
 18. Theapparatus of claim 14, wherein the objects stored in the second node setare homogenous.
 19. The apparatus of claim 14, wherein the objectsstored in the first node set are heterogenous compared to the objectsstored in the second node set.
 20. The apparatus of claim 14, whereinthe clustering mechanism is based on a clustering algorithm.
 21. Theapparatus of claim 20, wherein the clustering algorithm is iterative.22. The apparatus of claim 14, wherein the at least one firstintra-layer link defines the relation between the connected one of thedifferent objects in the first node set.
 23. The apparatus of claim 14,wherein the at least one second intra-layer link defines the relationbetween the connected ones of the different objects in the second nodeset.
 24. The apparatus of claim 14, wherein the at least one inter-layerlink defines the association between the connected object of the firstnode set and the connected object of the second node set.
 25. Theapparatus of claim 14, further comprising: at least one firstintra-layer link that connects different objects of the first node set;at least one second intra-layer link that connects different objects ofthe second node set; and wherein the clustering mechanism clustersobjects of the first node set based on weighted values of the at leastone first intra-layer link and the at least one second intra-layer link.26. The apparatus of claim 14, wherein the at least one inter-layer linkfurther comprising a hidden layer, wherein the intra-layer link includesa first hidden inter-layer link that extends between the first node setand the hidden layer, and wherein the intra-layer link includes a secondhidden inter-layer link that extends between the second node set and thehidden layer.
 27. The apparatus of claim 26, wherein weights of thefirst hidden inter-layer link and the second hidden inter-layer link areiteratively modified.
 28. A method, comprising: iteratively clustering,including separately clustering a first group of data objects and asecond group of data objects, both the first group of data objects andthe second group of data objects have separate data types; and analyzingcertain links from the first group of data objects and the second groupof data objects, including analyzing inter-layer links between certainobjects of the first group of data objects and certain objects from thesecond group of data objects.
 29. The method of claim 28, furthercomprising analyzing intra-layer links between certain objects of thefirst group of data objects and certain other objects from the firstgroup of data objects.
 30. The method of claim 29, further comprisinganalyzing intra-layer links between certain objects of the second groupof data objects and certain other objects from the second group of dataobjects.
 31. A computer readable medium having computer executableinstructions for generating a concept network, which when executed by aprocessor, causes the processor to: iteratively cluster, includingseparately clustering a group of Web page data objects and a group ofuser data objects, both the group of Web page data objects and the groupof user data objects have separate data types; and analyze certain linksfrom the group of Web page data objects and the group of user dataobjects, including analyze inter-layer links between certain objects ofthe group of Web page data objects and certain objects from the group ofuser data objects.
 32. The computer readable medium having computerexecutable instructions of claim 31, wherein the analyzing the certainlinks include analyzing intra-layer links between certain objects of thegroup of Web page data objects and certain other objects from the groupof Web page data objects.
 33. The computer readable medium havingcomputer executable instructions of claim 31, wherein the analyzing thecertain links include analyzing intra-layer links between certainobjects of the group of user data objects and certain other objects fromthe group of user data objects.
 34. A method for clustering, comprising:inputting a framework graph; calculating importance of a plurality ofobject nodes; selecting an arbitrary layer for clustering; clusteringnodes in the selected layer using a clustering algorithm; merging eachclustered sets of nodes into one node; updating associated links basedon the merged node; selecting a second layer that differs from thearbitrary layer; clustering nodes in the second layer; and updating theframework graph into a new framework graph based on the clustered nodesin the second layer and the clustered nodes in the arbitrary layer. 35.A clustering method for clustering a group of objects, comprising:structuring the objects into a plurality of layers; establishing atleast some links between certain ones of the objects; evaluating theoverall link structure for certain objects including links that connectto other objects that are in the same layer as the evaluated object, andother objects that are not in the same layer as the evaluated object,wherein such link structure may contain different importance informationreflecting the popularity of the other objects to each object beingevaluated.
 36. The clustering method of claim 35, wherein the evaluatedobject is a Web page object.
 37. The clustering method of claim 35,where certain of the other objects are Web page objects.
 38. Theclustering method of claim 35, where certain of the other objects areuser objects.
 39. A method comprising: collecting a group of users'logs; calculating the probability. of the user will visit the Web pageat a specific time; determining the concept for the hidden conceptlayer; randomly assigning initial parameters for nodes in the hiddenconcept layer; calculating a value representing the probability that aWeb page is classified into a concept category; and calculating theusers interest in the links between a user node and a hidden layer node.40. The method of claim 39, further comprising: updating the probabilitythat a Web page is classified into a concept category; and re-estimatingthe parameters for each hidden concept node.
 41. The method of claim 40,wherein the calculated initial parameters relates to an expectationvector.
 42. The method of claim 40, wherein the calculated initialparameter relates to a covariance vector.
 43. A method comprising:collecting a group of logs of a first object type; calculating theprobability of the first object type will interface with the secondobject type at a specific time; determining the concept for a hiddenconcept layer; randomly assigning initial parameters for nodes in thehidden concept layer; calculating a value representing the probabilitythat the second object type is classified into a concept category; andcalculating the users interest in the links between a first object typenode and a hidden layer node.
 44. The method of claim 43, furthercomprising: updating the probability that a node of the second objecttype is classified into a concept category; and re-estimating theparameters for each hidden concept node.